PIma indians diabetes

Pima indians diabetes dataset

Target (dependent) variable, Outcome.

Independent variables include the followings
- Number of times pregnant
- Plasma glucose concentration after 2 hours in an oral glucose tolerance test
- Diastolic blood pressure (mm Hg)
- Body mass index (weight in kg/(height in m)^2)
- Triceps skin fold thickness (mm)
- 2-Hour serum insulin (mu U/ml)
- Diabetes pedigree function

Import data

Code
df = read.csv("data/diabetes.csv")
head(df)
  Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
1           6     148            72            35       0 33.6
2           1      85            66            29       0 26.6
3           8     183            64             0       0 23.3
4           1      89            66            23      94 28.1
5           0     137            40            35     168 43.1
6           5     116            74             0       0 25.6
  DiabetesPedigreeFunction Age Outcome
1                    0.627  50       1
2                    0.351  31       0
3                    0.672  32       1
4                    0.167  21       0
5                    2.288  33       1
6                    0.201  30       0

Exploration

Code
df$Outcome = factor(df$Outcome)
summary(df)
  Pregnancies        Glucose      BloodPressure    SkinThickness  
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
 Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
 Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
    Insulin           BMI        DiabetesPedigreeFunction      Age       
 Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
 1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
 Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
 Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
 3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
 Outcome
 0:500  
 1:268  
        
        
        
        

Replace zero w/ mean (I)

replace zero w/ NA, then replace NA w/ mean
first, replace zero w/ NA
- dplyr::mutate, across, na_if

Code
library(dplyr) 
df2 = df %>% 
  mutate(across(2:6, ~na_if(.x, 0))) 
summary(df2)
  Pregnancies        Glucose      BloodPressure    SkinThickness  
 Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00  
 Median : 3.000   Median :117.0   Median : 72.00   Median :29.00  
 Mean   : 3.845   Mean   :121.7   Mean   : 72.41   Mean   :29.15  
 3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.00  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
                  NA's   :5       NA's   :35       NA's   :227    
    Insulin            BMI        DiabetesPedigreeFunction      Age       
 Min.   : 14.00   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
 1st Qu.: 76.25   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
 Median :125.00   Median :32.30   Median :0.3725           Median :29.00  
 Mean   :155.55   Mean   :32.46   Mean   :0.4719           Mean   :33.24  
 3rd Qu.:190.00   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
 Max.   :846.00   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
 NA's   :374      NA's   :11                                              
 Outcome
 0:500  
 1:268  
        
        
        
        
        

Replace zero w/ mean (II)

then, replace NA w/ mean
- dplyr::mutate, across, replace_na - mean

Code
library(tidyr)
df3 = df2 %>% 
  mutate(across(2:6, ~replace_na(.x, round(mean(.x, na.rm = T)))))
summary(df3)
  Pregnancies        Glucose       BloodPressure    SkinThickness  
 Min.   : 0.000   Min.   : 44.00   Min.   : 24.00   Min.   : 7.00  
 1st Qu.: 1.000   1st Qu.: 99.75   1st Qu.: 64.00   1st Qu.:25.00  
 Median : 3.000   Median :117.00   Median : 72.00   Median :29.00  
 Mean   : 3.845   Mean   :121.69   Mean   : 72.39   Mean   :29.11  
 3rd Qu.: 6.000   3rd Qu.:140.25   3rd Qu.: 80.00   3rd Qu.:32.00  
 Max.   :17.000   Max.   :199.00   Max.   :122.00   Max.   :99.00  
    Insulin           BMI        DiabetesPedigreeFunction      Age       
 Min.   : 14.0   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
 1st Qu.:121.5   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
 Median :156.0   Median :32.00   Median :0.3725           Median :29.00  
 Mean   :155.8   Mean   :32.45   Mean   :0.4719           Mean   :33.24  
 3rd Qu.:156.0   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
 Outcome
 0:500  
 1:268  
        
        
        
        

EDA: distribution

Code
library(GGally) 
ggpairs(df3)

EDA: correlation coefficients

Code
library(ggcorrplot) 
r = cor(df3[,-9]) 
pmat = cor_pmat(r) 
ggcorrplot(r, hc.order = TRUE, type = "lower", lab = TRUE, p.mat = pmat)

Fit logistic regression

Code
lr_fit = glm(Outcome ~ ., data = df3, family = "binomial") 
summary(lr_fit)

Call:
glm(formula = Outcome ~ ., family = "binomial", data = df3)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6712  -0.7181  -0.3953   0.7124   2.3761  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)              -9.0899172  0.8121027 -11.193  < 2e-16 ***
Pregnancies               0.1251221  0.0323909   3.863 0.000112 ***
Glucose                   0.0373449  0.0038787   9.628  < 2e-16 ***
BloodPressure            -0.0089408  0.0085602  -1.044 0.296272    
SkinThickness             0.0032461  0.0131290   0.247 0.804715    
Insulin                  -0.0007828  0.0011740  -0.667 0.504897    
BMI                       0.0933558  0.0178437   5.232 1.68e-07 ***
DiabetesPedigreeFunction  0.8663243  0.2963963   2.923 0.003468 ** 
Age                       0.0132023  0.0095108   1.388 0.165097    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 993.48  on 767  degrees of freedom
Residual deviance: 713.23  on 759  degrees of freedom
AIC: 731.23

Number of Fisher Scoring iterations: 5

Stepwise feature selection

to find the subset of variables for the best performing model
- forward selection
- backward elimination
- stepwise selection(mixed)

Code
library(MASS)
step_model = stepAIC(lr_fit, direction = "both", trace = F) 
summary(step_model)

Call:
glm(formula = Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction, 
    family = "binomial", data = df3)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.8235  -0.7243  -0.3992   0.7253   2.4350  

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)              -9.189830   0.705765 -13.021  < 2e-16 ***
Pregnancies               0.143397   0.027547   5.206 1.93e-07 ***
Glucose                   0.036917   0.003491  10.576  < 2e-16 ***
BMI                       0.088693   0.014726   6.023 1.72e-09 ***
DiabetesPedigreeFunction  0.882700   0.294785   2.994  0.00275 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 993.48  on 767  degrees of freedom
Residual deviance: 716.13  on 763  degrees of freedom
AIC: 726.13

Number of Fisher Scoring iterations: 5

Forest plot

Code
library(sjPlot) 
plot_model(step_model, sort.est = T)

Visualize regression model

Code
library(visreg) 
library(ggpubr) 
p = visreg(step_model, scale = "response", gg = T) # gg = T; ggplot2 as the engine 
ggarrange(plotlist = p)

Evalulate model performance

Code
prob = predict(step_model, newdata = df3, type = "response")
pred = rep(0, length(prob))
pred[prob > 0.5] = 1
mean(pred == df3$Outcome); table(pred, df3$Outcome)
[1] 0.7721354
    
pred   0   1
   0 443 118
   1  57 150

Linear discriminant analysis

Code
lda_fit = lda(Outcome ~ ., data = df3) 
plot(lda_fit)

Evaluate model performance

  • accuracy = 1- error rate
  • confusion matrix
Code
pred = predict(lda_fit, newdata = df3) 
pred = pred$class
mean(pred == df3$Outcome); table(pred, df3$Outcome)
[1] 0.7760417
    
pred   0   1
   0 445 117
   1  55 151